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M multiprocessors using microbenchmarks and scientific applications 
^ Ravi Iyer, Nancy M. Amato, Lawrence Rauchwerger, Laxmi Bhuyan 

May 1999 Proceedings of the 13th international conference on Supercomputing 

Publisher: ACM Press 
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^ Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, Luiz Andre Barroso 
October 1998 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 

Proceedings of the eighth international conference on Architectural 
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I- II A ^ I ui -jiT/- f /ir>\ Additional Information: full citation, abstract, references, citings , index 
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Database applications such as online transaction processing (OLTP) and decision support 
systems (DSS) constitute the largest and fastest-growing segment of the market for 
multiprocessor servers. However, most current system designs have been optimized to 
perform well on scientific and engineering workloads. Given the radically different 
behavior of database workloads (especially OLTP), it is important to re-evaluate key 
system design decisions in the context of this important class of applicatio ... 

^ ArchJ.tecture and„ 

Kourosh Gharachorloo, Madhu Sharma, Simon Steely, Stephen Van Doren 
November 2000 ACM SIGARCH Computer Architecture News , ACM SIGOPS Operating 
Systems Review , Proceedings of the ninth international conference 
on Architectural support for programming languages and operating 

systems ASPLOS-DC, Volume 28 , 34 Issue 5 , 5 

Publisher: ACM Press 
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This paper describes the architecture and implementation of the AlphaServer GS320, a 
cache-coherent non-uniform memory access multiprocessor developed at Compaq. The 
AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing 
with 32 to 64 processors. Each node in the design consists of four Alpha 21264 
processors, up to 32GB of coherent memory, and an aggressive 10 subsystem. The 
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current implementation supports up to 8 sucli nodes for a total of 32 processors. While 
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Additional Information: fuH cltatjoQ, abstract, refereaces, citings, index 



Full text available: ' PDdfn,67MB) 

terms 

This paper describes the architecture and implementation of the AlphaServer GS320, a 
cache-coherent non-uniform memory access multiprocessor developed at Compaq. The 
AlphaServer GS320 architecture Is specifically targeted at medium-scale multiprocessing 
with 32 to 64 processors. Each node in the design consists of four Alpha 21264 
processors, up to 32GB of coherent memory, and an aggressive 10 subsystem. The 
current implementation supports up to 8 such nodes for a total of 32 processors. While 
s ... 

Parallel iogic programming systems 

Jacques Chassin de Kergommeaux, Philippe Codognet 

September 1994 ACM Computing Surveys (CSUR), volume 26 issue 3 

Publisher: ACM Press 

Additional Information: full citation , abstract, references , citings. Index 



Full text available: " Podf/S.SI MB) 
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Parallelizing logic programming has attracted much Interest in. the research community, 
because of the intrinsic OR- and AND-parallelisms of logic programs. One research stream 
aims at transparent exploitation of parallelism in existing logic programming languages 
such as Prolog, while the family of concurrent logic languages develops language 
constructs allowing programmers to express the concurrency— that is, the communication 
and synchronization between parallel processes— withi ... 

Keywords: AND-parallelism, OR-parallelism, Prolog, Warren Abstract Machine, binding 
arrays, concurrent constraint programming, constraints, guard, hash windows, load 
balancing, massive parallelism, memory management, multisequential implementation 
techniques, nondeterminism, scheduling parallel tasks, static analysis 
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annual international symposium on Computer architecture ISCA '97, volume 
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Publisher: ACM Press 
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Current shared-memory multiprocessors are inherently vulnerable to faults: any 
significant hardware or system software fault causes the entire system to fail. Unless 
provisions are made to limit the impact of faults,, users will perceive a decrease in 
reliability when they entrust their applications to larger machines. This paper shows that 
fault containment techniques can be effectively applied to scalable shared-memory 
multiprocessors to reduce the reliability problems created by increased mach ... 

SMTp: An Architecture for Next-generation Scalable Multi-threading D 
Mainak Chaudhuri, Mark Heinrich 

March 2004 ACM SIGARCH Computer Architecture News , Proceedings of the 31st 
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Publisher: IEEE Computer Society, ACM Press 
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We introduce the SMTp arciiitecture-an SMT processoraugmented witli a coherence 
protocol thread context,that together with a standard integrated memory controllercan 
enable the design of (among other possibilities) scalablecache-coherent hardware 
distributed shared memory(DSM) machines from commodity nodes. We describe theminor 
changes needed to a conventional out-of-order multi-threadedcore to realize SMTp, 
discussing issues related toboth deadlock avoidance and performance. We then 
compareSMTp p ... 

Jhe.SGi..OrigLn:.a„ccNU Q 
James Laudon, Daniel Lenoski 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture ISCA '97, volume 

25 Issue 2 

Publisher: ACM Press 
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The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) 
multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system 
was designed from the ground up as a multiprocessor capable of scaling to both small and 
large processor counts without any bandwidth, latency, or cost cliffs. The Origin system 
consists of up to 512 nodes interconnected by a scalable Craylink network. Each node 
consists of one or two RIOOOO processors, up to 4 GB of coherent memory, and ... 

Pgrailel.and.dis^^^^^^ 

Russell M. Clapp, Trevor Mudge 

January 1990 ACM SIGAda Ada Letters , Proceedings of the working group on Ada 

performance issues 1990, volume x issue 3 
Publisher: ACM Press 

Full text available: ^odtf 459. 54 KB^ Additional Information: full citation , index tenr^s 



Efficient. Shared. mem Q 
Leonidas 1. Kontothanassis, Michaei L. Scott 
^ September 1995 ACM SIGARCH Computer Architecture News, Volume 23 issue 4 

Publisher: ACM Press 

Full text available: ^Mti&36.,.07 .KB) Additional Infornnation: MI..d.tat.io.n, abstract, jMsxlerms 

Shared memory is widely regarded as a more intuitive model than message passing for 
the development of parallel programs. A shared memory model can be provided by 
hardware, software, or some combination of both. One of the most important problems to 
be solved in shared memory environments is that of cache coherence. Experience 
indicates, unsurprisingly, that hardware-coherent multiprocessors greatly outperform 
distributed shared-mennory (DSM) emulations on message-passing hardware. 
Intermediate o ... 

Sp.ac.e4ime..schedujjn5..^ Q 
^ Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek 
Sarkar, Saman Amarasinghe 

October 1998 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 

Proceedings of the eighth international conference on Architectural 
support for programming languages and operating systems ASPLOS- 

VIII, Volume 33 , 32 Issue 11 , 5 

Publisher: ACM Press 

Additional Information: full cKation. abstrc:Ct , refere^nces. citings. Index 



Full text available: W odfi 179 MBj 
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Increasing demand for both greater parallelism and faster clocks dictate that future 
generation architectures will need to decentralize their resources and eliminate primitives 
that require single cycle global communication. A Raw microprocessor distributes all of its 
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resources, including instruction streams, register files, memory ports, and ALUs, over a 
pipelined two-dimensional. mesfi interconnect, and exposes tfiem fully to the compiler. 
Because communication in Raw macliines Is distributed, com ... 

Access.normaljzMi^^^^ 
1^ Wei Li, Keshav Pingali 

^ September 1992 ACM SIGPLAN Notices , Proceedings of the fifth international 

conference on Architectural support for programming languages and 
operating systems ASPLOS-V, volume 27 issue 9 
Publisher: ACIVI Press 

Full text available- '^3tV!B: Additional Information: full citation, abstrgict, references , citings, Index 
u e aval a e. ]^.p..„i..,„:....A j .^^^^ 

In scalable parallel machines, processors can mal<e local memory accesses much faster 
than they can make remote memory accesses. In addition, when a number of remote 
accesses must be made, it is usually more efficient to use block transfers of data rather 
than to use many small messages. To run well on such machines, software must exploit 
these features. We believe it is too onerous for a programmer to do this by hand, so we 
have been exploring the use of restructuring compiler tecnology for ... 

Systems support for scalabie data mining 
^ William A. Maniatty, Mohammed J. Zaki 

December 2000 ACM SIGKDD Explorations Newsletter, volume 2 issue 2 

Publisher: ACM Press 

Full text available: f g^Ddf(l13 MB) Additional Information: full citation , index terms . 
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Manohar Rao, Zary Segall, Dailbor Vrsaiovic 

November 1990 Proceedings of the 1990 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society 

Full text available: '^.pdfj[l,05.MBj Additional Information: Ml.PitatJon, abstract, reierences 

Most supercomputers today are parallel computers. In this paper, an approach for 
efficiently mapping parallel applications onto parallel MIMD machine architectures is 
introduced. The applicability of this approach to uniform memory access multiprocessors 
is demonstrated. The paper shows that an intermediate layer of abstraction between the 
application level and the parallel architecture level is conducive to not only a better 
software productivity, but also to performance efficient programs. The ... 

15 Analytic evaiuation of shared-memory systems with !LP processors Q 
]^ Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, David A. Wood 
^ April 1998 ACM SIGARCH Computer Architecture News , Proceedings of the 25th 

annual international symposium on Computer architecture ISCA '98, volume 

26 Issue 3 

Publisher: IEEE Computer Society, ACM Press 

Full text available: ^ ^^^^^ ^^^^ Additional Information: full citation. abstrg:d:, references, citings, index 
Publistier Site 

This paper develops and validates an analytical model for evaluating various types of 
architectural alternatives for shared -memory systems with processors that aggressively 
exploit Instruction-level parallelism. Compared to simulation, the analytical model is many 
orders of magnitude faster to solve, yielding highly accurate system performance 
estimates in seconds.The model input parameters characterize the ability of an application 
to exploit instruction-level parallelism as well as the interac ... 
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16 Data distribution support on distributed shared memory muitiprocessors Q 
1^ Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, Jennifer M. 
Anderson 

May 1997 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1997 conference 
on Programming language design and implementation PLDI '97, volume 32 

Issue 5 

Publisher: ACM Press 

r- II* ^ I ui cjT ^iin^ Additional Information: full citation , abstract, references, citings, index 

Full text available: " rnDdftlSS MB) ; 

terms 

Cache-coherent multiprocessors with distributed shared memory are becoming 
increasingly popular for parallel computing. However, obtaining high performance on 
these machines mquires that an application execute with good data locality. In addition to 
making efiective use of caches, it is often necessary to distribute data structures across 
the local memories of the processing nodes, thereby reducing the latency of cache 
misses. We have designed a set of abstractions for performing data distributio ... 

Com|3arMive.pMorman Q 

architectures 
^ Per Stenstrdm, Truman Joe, Anoop Gupta 

April 1992 ACM SIGARCH Computer Architecture News , Proceedings of the 19th 

annual international symposium on Computer architecture ISCA '92, volume 

20 Issue 2 

Publisher: ACM Press 

Additional Information: ML?ititjon, abstract, .referencejs, citings, index 



Full text available: ^ 

' ^ terms 

Two interesting variations of large-scale shared-memory machines that have recently 
emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and 
cache-only memory architectures (COMA). They both have distributed main memory and 
use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines 
automatically migrate and replicate data at the main-memory level in cache-line sized 
chunks. This paper compares the performance of these two classes ... 

18 Unified compilation techniques for shared and distributed address space machines 
Chau-Wen Tseng, Jennifer M. Anderson, Saman P. Amarasinghe, Monica S. Lam 
July 1995 Proceedings of the 9th international conference on Supercomputing 

Publisher: ACM Press 

Full text available: " P l DdftriQ MB) Additional Information: full citation, references, citings , inciex temns 
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David Bauer, Garrett Yaun, Christopher D. Carothers, Murat Yuksel, Shivkumar 

Kalyanaraman 

June 2005 Proceedings of the 19th Workshop on Principles of Advanced and 

Distributed Simulation PADS '05 
Publisher: IEEE Computer Society 

Full text available: ^ pdfM44.57 KB) Additional Information: full citaMon. abstrg:ct 

In this paper we introduce a new concept, network atomic operations (NAOs) to create a 
zero-cost consistent cut. Using NAOs, we define a wall-clock-time driven GVT algorithm 
called Seven 0?lock that is an extension of Fujimoto? shared memory GVT algorithm. 
Using this new GVT algorithm, we report good optimistic parallel performance on a cluster 
of state-of-the-art Itanium-II quad processor systems for both benchmark applications 
such as PHOLD and real-world applications such as a large-scale T ... 

EarJx.Expenence.Mh.Scje 

Wendell Anderson, Preston Briggs, C, Stephen Hellberg, Daryl W. Hess, Alexei Khokhlov, 
Marco Lanzagorta, Robert Rosenberg 

November 2003 Proceedings of the 2003 ACM/IEEE conference on Supercomputing 
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Publisher: IEEE Computer Society 

Full text available: ^M4213 J9. KB) Additional Information: MLQltation, abstract 

We describe our experiences porting and tuning tiiree scientific progranns to the Cray 
MTA-2, paying particular attention to the problenns posed by I/O. We have measured the 
performance of each of the programs over many different machine configurations and we 
report on the scalability of each program. In addition, we compare the performance of the 
MTA with that of an SGI Origin running all three programs. 
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Database applications such as online transaction processing (OLTP) and decision sijpport 
systems (DSS) constitute the largest and fastest-growing segment of the market for 
multiprocessor servers. However, most current system designs have been optimized to 
perform well on scientific and engineering workloads. Given the radically different 
behavior of database workloads (especially OLTP), it is important to re-evaluate key 
system design decisions in the context of this important class of applicatio ... 

Parallel logic programming systems 

Jacques Chassin de Kergommeaux, Philippe Codognet 

September 1994 ACM Computing Surveys (CSUR), Volume 26 issue 3 

Publisher: ACM Press 

Additional Information: fuii citation, abstract, references , citings, index 
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Parallelizing logic programming has attracted much interest in the research community, 
because of the intrinsic OR- and AND-parallelisms of logic programs. One research stream 
aims at transparent exploitation of parallelism in existing logic programming languages 
such as Prolog, while the family of concurrent logic languages develops language 
constructs allowing programmers to express the concurrency— that is, the communication 
and synchronization between parallel processes— with! ... 

Keywords: AND-parallelism, OR-parallelism, Prolog, Warren Abstract Machine, binding 
arrays, concurrent constraint programming, constraints, guard, hash windows, load 
balancing, massive parallelism, memory management, multisequential implementation 
techniques, nondeterminism, scheduling parallel tasks, static analysis 
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